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ABSTRACT 

A popular tool for unsupervised modelling and mining multi-aspect 
data is tensor decomposition. In an exploratory setting, where and 
no labels or ground truth are available how can we automatically 
decide how many components to extract? How can we assess the 
quality of our results, so that a domain expert can factor this qual¬ 
ity measure in the interpretation of our results? In this paper, we 
introduce AutoTen, a novel automatic unsupervised tensor min¬ 
ing algorithm with minimal user intervention, which leverages and 
improves upon heuristics that assess the result quality. We exten¬ 
sively evaluate AutoTen’s performance on synthetic data, outper¬ 
forming existing baselines on this very hard problem. Finally, we 
apply AutoTen on a variety of real datasets, providing insights 
and discoveries. We view this work as a step towards a fully auto¬ 
mated, unsupervised tensor mining tool that can be easily adopted 
by practitioners in academia and industry. 


One challenge, which has received considerable attention, is the 
one of making tensor decompositions scalable to today’s web scale. 
For instance, Facebook has around 2 billion users at the time of 
writing of this paper and is ever growing, and making tensor de¬ 
compositions able to work on even small portions of the entire 
Facebook network is imperative for the adoption of these tech¬ 
niques by such big players. Very frequently, data that fall under 
the aforementioned category turn out to be highly sparse; the rea¬ 
son is that, e.g. each person on Facebook interacts with only a 
few hundreds of the users. Computing tensor decompositions for 
highly sparse scenarios is a game changer, and exploiting sparsity 
is key in scalability. The work of Kolda et al. ifTHTTj introduced the 
first such approach of exploiting sparsity for scalability. Later on, 
distributed approaches based on the latter formulation 0, or other 
scalable approaches 11 HUE] [HI have emerged. By no means 
do we claim that scalability is a solved problem, however, we point 
out that there has been significant attention to it. 
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1. INTRODUCTION 

Tensor decompositions and their applications in mining multi¬ 
aspect datasets are ubiquitous and ever increasing in popularity. 
Data Mining application of these techniques has been largely pi¬ 
oneered by the work of Kolda et al. m where the authors intro¬ 
duce a topical aspect to a graph between webpages, and extend 
the popular HITS algorithm in that scenario. Henceforth, the field 
of multi-aspect/tensor data mining has witnessed rich growth with 
prime examples of applications being citation networks |2]|, com¬ 
puter networks f2]f3ll4l. Knowledge Base data OHUG], and social 
networks dull II Ho), to name a few. 

Tensor decompositions are undoubtedly a very powerful analyt¬ 
ical tool with a rich variety of applications. However there exist 
research challenges in the held of data mining that need to be ad¬ 
dressed, in order for tensor decompositions to claim their position 
as a de-facto tool for practicioners. 



Figure 1: Starting from an unsupervised, exploratory application, Au- 
TOTen automatically determines a solution with high quality, outperform¬ 
ing existing baselines, and enables discoveries in real data. 

The main focus of this work, however, is on another, relatively 
less explored territory; that of assessing the quality of a tensor de¬ 
composition. In a great portion of tensor data mining, the task is 
exploratory and unsupervised: we are given a dataset, usually with¬ 
out any sort of ground truth, and we seek to extract interesting pat¬ 
terns or concepts from the data. It is crucial, therefore, to know 
whether a pattern that we extract actually models the data at hand, 
or whether it is merely modelling noise in the data. Especially in 
the age of Big Data, where feature spaces can be vast, it is im¬ 
perative to have a measure of quality and avoid interpreting noisy, 
random variation that always exists in the data. Determining the 
“right” number of components in a tensor is a very hard problem 



















03. This is why, many seminal exploratory tensor mining papers, 
understandably, set the number of components manually Garni 
HO}. When there is a specific task at hand, e.g. link prediction 
0U. recommendation fl9l , and supervised learning [20] [2D, that 
entails some measure of success, then there is some procedure (e.g. 
cross-validation) for selecting a good number of latent components 
which unfortunately cannot generalize to the case where labels or 
ground truth are absent. 

However, not all hope is lost. There have been very recent ap¬ 
proaches following the Minimum Description Length (MDL) prin¬ 
ciple (221123) . where the MDL cost function usually depends heav¬ 
ily on the application at hand (e.g. community detection or boolean 
tensor clustering respectively). Additionally, there have been Bayesian 
approaches (24j that, as in the MDL case, do not require the num¬ 
ber of components as input. These approaches are extremely in¬ 
teresting, and we reserve their deeper investigation in future work, 
however in this work, we choose to operate on top of a different, 
very intuitive approach which takes into account properties of the 
PARAFAC decomposition ED and is application independent, re¬ 
quiring no prior knowledge about the data; there exists highly in¬ 
fluential work in the Chemometrics literature m that introduces 
heuristics for determining a good rank for tensor decompositions. 
Inspired by and drawing from 03, we provide a comprehensive 
method for mining multi-aspect datasets using tensor decomposi¬ 
tions. 

Our contributions are: 

• Algorithms We propose AutoTen, a comprehensive method¬ 
ology on mining multi-aspect datasets using tensors, which 
minimizes manual trial-and-error intervention and provides 
quality characterization of the solution (Section l3.2t . Fur¬ 
thermore, we extend the quality assessment heuristic of 1261 
assuming KL-divergence, which has been shown to be more 
effective in highly sparse, count data (27) (Section l3.1) . 

• Evaluation & Discovery We conduct a large scale study on 
10 real datasets, exploring the structure of hidden patterns 
within these datasets. To the best of our knowledge, this is 
the first such broad study. (Section [5. H . As a data mining 
case study, we apply AutoTen to two real datasets discov¬ 
ering meaningful patterns (Section [5.2t . Finally, we exten¬ 
sively evaluate our proposed method in synthetic data (Sec¬ 
tion O). 

In order to encourage reproducibility, most of the datasets used are 
public, and we make our code publicly available at 

http://www.cs.emu.edu/~epapalex/src/AutoTen.zip 

2. BACKGROUND 

Table Q] provides an overview of the notation used in this and 
subsequent sections. 

2.1 Brief Introduction to Tensor Decomposi¬ 
tions 

Given a tensor X, we can decompose it according to the CP/PARAFAC 
decomposition 1 251 (henceforth referred to as PARAFAC) as a sum 
of rank-one tensors: 

F 

X « a/ob/oc/ 
f=i 

where the (i,j, k) entry of a r o b r o c r is a. r (i)h r (j)c r (k). Usu¬ 
ally, PARAFAC is represented in its matrix form [A, B, C], where 
the columns of matrix A are the a r vectors (and accordingly for 
B, C). The PARAFAC decomposition is especially useful when 
we are interested in extracting the true latent factors that generate 


Symbol 

Definition 

X. X, x, x 

Tensor, matrix, column vector, scalar 

o 

outer product 

vec( ) 

vectorization operator 

0 

Kronecker product 

XT 

Moore-Penrose pseudoinverse 

* 0 

element-wise multiplication and division 

AT 

Moore-Penrose Pseudoinverse of A 

DKL{<A\b) 

KL-Divergence 

II a IIf 

Frobenius norm 

KronMatVec 

efficient computation of 
y = (Ai 0 A .2 0 • ■ ■ 0 A„) x (28) 

x(i) 

z-th entry of x (same for matrices and tensors) 

X(:,i) 

spans the entire 2 -th column of X (same for tensors) 

xlO 

value at the k- th iteration 

CP NMU 

non-negative, Frobenius norm PARAFAC 11291 

CP APR 

KL-Divergence PARAFAC 1271 


Table 1: Table of symbols 


the tensor. In this work, we choose the PARAFAC decomposition 
as our tool, since it admits a very intuitive interpretation of its la¬ 
tent factors; each component can be seen as soft co-clustering of 
the tensor, using the high values of vectors a r , b r , c r as the mem¬ 
bership values to co-clusters. 

Another very popular Tensor decomposition is Tucker3 (30), where 
a tensor is decomposed into rank-one factors times a core tensor: 

p Q R 

*«EEE £(P. 9 , r)u p ovgowr 

p= 1 g=l r=l 

where U, V, W are orthogonal. The Tucker3 model is especially 
used for compression. Furthermore, PARAFAC can be seen as a re¬ 
stricted Tucker3 model, where the core tensor G is super-diagonal, 
i.e. non-zero values are only in the entries where i = j = k. This 
observation will be useful in order to motivate the CORCONDIA 
diagnostic. 

Finally, there also exist more expressive, but harder to interpret 
models, such as the Block Term Decomposition (BTD) HD , how¬ 
ever, we reserve future work for their investigation. 

2.2 Brief Introduction to CORCONDIA 

As outlined in the Introduction, in the chemometrics literature, 
there exists a very intuitive heuristic by the name of CORCON¬ 
DIA (32), which can serve as a guide in judging how well a given 
PARAFAC decomposition is modelling a tensor. 

In a nutshell, the idea behind CORCONDIA is the following: 
Given a tensor X and its PARAFAC decomposition A, B, C, one 
could imagine fitting a Tucker3 model where matrices A, B, C are 
the factors of the Tucker3 decomposition and G is the core tensor 
(which we need to solve for). Since, as we already mentioned, 
PARAFAC can be seen as a restricted Tucker3 decomposition with 
super-diagonal core tensor, if our PARAFAC modelling of X using 
A, B, C is modelling the data well, the core tensor G should be as 
close to super-diagonal as possible. If there are deviations from the 
super-diagonal, then this is a good indication that our PARAFAC 
model is somehow flawed (either the decomposition rank is not 
appropriate, or the data do not have the appropriate structure). We 
can pose the problem as the following least squares problem: 

nun \\vec (X) — (A ® B 0> C) vec (G) |||, 

with the least squares solution: vec (G) = (A ®B@C) vec (X) 

After computing G, the CORCONDIA diagnostic can be com- 



















. , lnnM 'Ef=i'Ef=i'Ek=i(G{i,j,k)-I(i,j,k)Y 

puted as c = 100 1--- 


where I is a super-diagonal tensor with ones on the ( i, i, i ) entries. 
For a perfectly super-diagonal G (i.e. perfect modelling), c will be 
100. One can see that for rank-one models, the metric will always 
be 100, because the rank one component can trivially produce a 
single element “super-diagonal” core; thus, CORCONDIA is ap¬ 
plicable for rank two or higher. According to 03, values below 50 
show some imperfection in the modelling or the rank selection; the 
value can also be negative, showing severe problems with the mod¬ 
elling. In (32), some of the chemical data analyzed have perfect, 
low rank PARAFAC structure, and thus expecting c > 50 is rea¬ 
sonable. In many data mining applications, however, due to high 
data sparsity, data cannot have such perfect structure, but an ap¬ 
proximation thereof using a low rank model is still very valuable. 
Thus, in our case, we expand the spectrum of acceptable solutions 
with reasonable quality to include smaller, positive values of c (e.g. 
20 or higher). 


2.3 Scaling Up CORCONDIA 

As we mention in the Introduction CORCONDIA as it’s intro¬ 
duced in (26) is suitable for small and dense data. However, this 
contradicts the area of interest of the vast majority of data mining 
applications. To that end, very recently (33) we extended COR¬ 
CONDIA to the case where our data are large but sparse, deriving 
a fast and efficient algorithm. Key behind (33) is avoiding to pseu¬ 
doinvert (A 0 B 0 C) 

In order to achieve the above, we need to reformulate the com¬ 
putation of CORCONDIA. The pseudoinverse (A 0 B 0 C)^ can 
be rewritten as 

(Va ®V b « V c ) (Ea ” 1 0 Sb ’ 1 0 Ec" 1 ) (Ua T 0 U b T 0 U C T ) 

where A = U a X a V a T , B = U b X b V b T , and C = U C S C V C T 
(i.e. the respective Singular Value Decompositions). 

After we rewrite the least squares problem in the above form, 
we can efficiently carry out a series of Kronecker products times 
a vector very efficiently, without materializing the (potentially big) 
Kronecker product. In (33) we use the algorithm proposed in (281 
to do this, which we will henceforth refer to as KronMatVec 
operation: 

y = (Ai 0 A 2 0 • • • 0 A n ) x 

What we have achieved thus far is extending CORCONDIA to 
large and sparse data, assuming Frobenius norm. This assumption 
postulates that the underlying data distribution is Gaussian. How¬ 
ever. recently 03 showed that for sparse data that capture counts 
(e.g. number of messages exchanged), it is more beneficial to pos¬ 
tulate a Poisson distribution, therefore using the KL-Divergence as 
a loss function. This has been more recently adopted in (34]| show¬ 
ing very promising results in biomedical applications. Therefore, 
one natural direction, which we follow in the first part of the next 
section, is to extend CORCONDIA for this scenario. 


3. PROPOSED METHODS 

In exploratory data mining applications, the case is very fre¬ 
quently the following: we are given a piece of (usually very large) 
data that is of interest to a domain expert, and we are asked to iden¬ 
tify regular and irregular patterns that are potentially useful to the 
expert who is providing the data. During this process, very often, 
the analysis is carried out in a completely unsupervised way, since 
ground truth and labels are either very expensive or impossible to 
obtain. In our context of tensor data mining, here is the problem at 
hand: 


Informal Problem 1. Given a tensor X without ground truth 
or labelled data, how can we analyze it using the PARAFAC decom¬ 
position so that we can also: 

1. Determine automatically a good number of components for 
the decomposition 

2. Provide quality guarantees for the resulting decomposition 

3. Minimize human intervention and trial-and-error testing 

In order to attack the above problem, first, in Section 13.11 we 
describe how we can derive a fast and efficient metric of the quality 
of a decomposition, assuming the KL-Divergence. Finally, in !3.2l 
we introduce AutoTen, our unified algorithm for automatic tensor 
mining with minimal user intervention and quality characterization 
of the solution. 

3.1 Quality Assessment with KL-Divergence 

As we saw in the description of CORCONDIA with Frobenius 
norm loss, its computation requires solving the least squares prob¬ 
lem: 

min \\vec (X) — (A 0 B 0 C) vec (G) |||- 

In the case of the CP_APR modelling, where the loss function 
is the KL-Divergence, the minimization problem that we need to 
solve is: 


min T>KZ,(y||Wx) (1) 

X 

where in our case, W = A 0 B 0 C. 

Unlike the Frobenius norm case, where the solution to the prob¬ 
lem is the Least Squares estimate, in the KL-Divergence case, the 
problem does not have a closed form solution. Instead, iterative 
solutions apply. The most prominent approach to this problem 
is via an optimization technique called Majorization-Minimization 
(MM) or Iterative Majorization (35). In a nutshell, in MM, given 
a function that is hard to minimize directly, we derive a “majoriz¬ 
ing” function, which is always greater than the function to be min¬ 
imized, except for a support point where it is equal; we minimize 
the majorizing function, and iteratively updated the support point 
using the minimizer of that function. This procedure converges to 
a local minimum. For the problem of Eq. |T] (36) and subsequently 
(13, employ the following update rule for the problem, which is 
used iteratively until convergence to a stationary point. 


,,, ,, £-W(i,fc)G 

\( fc ) _ t v ’ 


x(j) w =x(jr~ L, c- 
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where y^ -1 ^ = Wx^ -1 *, and k denotes the fc-th iteration index. 

The above solution is generic for any structure of W. Remem¬ 
ber, however, that W has very specific Kronecker structure which 
we should exploit. Additionally, suppose that we have a 10 4 x 
10 4 x 10 4 tensor; then, the large dimension of W will be 10 12 . If 
we attempt to materialize, store, and use W throughout the algo¬ 
rithm, that can prove catastrophic to the algorithm’s performance. 
We can exploit the Kronecker structure of W so that we break 
down Eq. [2] into pieces, each one which can be computed effi¬ 
ciently, given the structure of W. The first step is to decompose the 
expression of the numerator of Eq. [2] In particular, we equivalently 
write x (fc) = x^ fc_1 ^ * z 2 where z 2 = W T zi and zi = y 0 y. 
Due to the Kronecker structure of W: 


z 2 = KronMatVec({A t , B t , C t }, zi) 

Therefore, the update to x^ is efficiently calculated in the three 
above steps. The normalization factor of the equation is equal to: 




s(j) = E i W(i, j). Given the Ki'onecker structure of W however, 
the following holds: 

CLAIM 1. The row sum of a Kronecker product matrix A ® B 
can be rewritten as ^Ei=i A(j, :)^ ® ^Ej=i B(j, 

PROOF. We can rewrite the row sums X^=i A(i,:) = if A and 
E J=] B(j,:) = if B where i_r and ij are all-ones column vectors 
of size I and J respectively. For the Kronecker product of the row 
sums and by using properties of the Kronecker product, and calling 
A ® B = W we have 

ij 

(ifA)®(ijB) = (ir ® b) T (A ® B) = if/W = ]T W(j,:) 

i= 1 

which concludes the proof. □ 

Thus, s = (£V A (i, :)) ® (Ej B U > O) ® (E„ c (n, 0) • 
Putting everything together, we end up with Algorithm [2] which 
is an efficient solution to the minimization problem of Equation 
[j] As in the naive case, we also use Iterative Majorization in the 
efficient algorithm; we iterate updating xuntil we converge to 
a local optimum. Finally, AlgorithmQjshows the steps to compute 
CORCONDIA under KL-Divergence efficiently. 


Algorithm 1: Efficient Quality Assesment with KL- 
Divergence loss 

Input: Tensor X and CP_APR factor matrices A, B, C. 

Output: CORCONDIA diagnostic c. 

1: some more stuff 
2: some stuff here 

Sili Ef=i ELi (g(», j. fc) - Kf j. fc)) 2 j 


Algorithm 2: Efficient Majorization Minimization for KL- 
Divergence Regression 

Input: Vector y and matrices A, B, C. 

Output: Vector x 
1: Initialize randomly 
2: y = KronMatVec({A, B, C}, x(°)) 

3: s = (Ei A(i,:)) ® (Ej B(j, :)) ® (E„ C(n,:)) 

4: Start loop: 

5: zi = y 0y 

6: Z2 = KronMatVec({A t ,B T ,C T }, zi) 

7: xW = x^" 1 ) * z 2 

8: y = KronMatVec({A, B, C}, x( fc l) 

9: End loop 

10: Normalize x( fc ) using s 


3.2 AutoTen: Automated Unsupervised Ten¬ 
sor Mining 

At this stage, we have the tools we need in order to design an 
automated tensor mining algorithm that minimizes human inter¬ 
vention and provides quality characterization of the solution. We 
call our proposed method AutoTen, and we view this as a step to¬ 
wards making tensor mining a fully automated tool, used as a black 
box by academic and industrial practicioners. 

AutoTen is a two step algorithm, where we first search through 
the solution space and at the second step, we automatically select a 


good solution based on its quality and the number of components 
it offers. A sketch of AutoTen follows, and is also outlined in 
Algorithm [3] 

Solution Search. 

The user provides a data tensor, as well as a maximum rank 
that reflects the budget that she is willing to devote to AutoTen’s 
search. We neither have nor require any prior knowledge whether 
the tensor is highly sparse, or dense, contains real values or counts, 
hinting whether we should use, say, CP_NMU postulating Frobenius 
norm loss, or CP_APR postulating KL-Divergence loss. 

Fortunately, our work in this paper, as well as our previous work 
(33) has equipped us with tools for handling all of the above cases. 
Thus, we follow a data-driven approach, where we let the data show 
us whether using CP_NMU or CP_APR is capturing better struc¬ 
ture. For a grid of values for the decomposition rank (bounded 
by the user provided maximum rank), we run both CP_NMU and 

CP_APR, and we record the quality of the result as measured by the 

CORCONDIA diagnostic into vectors c Fro and c kl (using the al¬ 
gorithm in |i33j and Algorithm[I]respectively), truncating negative 
values to zero. 

Result Selection. 

At this point, for both CP_NMU and CP_APR we have points in 
two dimensional space ( Fi, cf), reflecting the quality and the cor¬ 
responding number of components. Informally, our problem here 
is the following: 

Informal Problem 2. Given points {Fi, a) we need to find 
one that maximizes the quality of the decomposition, as well as 
finding as many hidden components in the data as possible. 

Intuitively, we are seeking a decomposition that discovers as many 
latent components as possible, without sacrificing the quality of 
those components. Essentially, we have a multi-objective optimiza¬ 
tion problem, where we need to maximize both a and Fi. However, 
if we, say, get the Pareto front of those points (i.e. the subset of all 
non-dominated points), we end up with a family of solutions with¬ 
out a clear guideline on how to select one. We propose to use the 
following, effective, two-step maximization algorithm that gives an 
intuitive data-driven solution: 

• Max c step: Given vector c, run 2-means clustering on its 
values. This will essentially divide the vector into a set of 
good/high values and a set of low/bad ones. If we call mi, m 2 
the means of the two clusters, then we select the cluster index 
that corresponds to the maximum between m 1 and m 2 . 

• Max F step: Given the cluster of points with maximum mean, 
we select the point that maximizes the value of F. We call 
this point (F* , c*). 

Another alternative is to formally define a function of c, F that we 
wish to maximize, and select the maximum via enumeration. Com¬ 
ing up with the particular function to maximize, considering the 
intuitive objective of maximizing the number of components that 
we can extract with reasonably high quality (c), is a hard problem, 
and we risk biasing the selection with a specific choice of a func¬ 
tion. Nevertheless, an example such function can be g{c,F) = 
logclogF for c > 0, and p(0,7 ? ) = 0; this function essentially 
measures the area of the rectangle formed by the lines connecting 
(F, c) with the axes (in the log-log space) and intuitively seeks to 
find a good compromise between maximizing F and c. This func¬ 
tion performs closely to the proposed data-driven approach and we 
defer a detailed discussion and investigation to future work. 

After choosing the “best” points (Ff ro , c Fro ) and {F{f L ,c* KL ), 
at the final step of AutoTen, we have to select between the results 











of CP_NMU and CP_APR. In order do so, we can use the following 
strategies: 

1. Calculate SFro = ^]cf„(/) and skl = J2 c klU)' 

f f 

and select the method that gives the largest sum. The in¬ 
tuition behind this data-driven strategy is choosing the loss 
function that is able to discover results with higher quality 
on aggregate, for more potential ranks. 

2. Select the results that produce the maximum value between 
cJv 0 and c* KL . This strategy is conservative and aims for the 
highest quality of results, possibly to the expense of compo¬ 
nents of lesser quality that could still be acceptable for ex¬ 
ploratory analysis. 

3. Select the results that produce the maximum value between 
Fpro and Fkl- Contrary to the previous strategy, this one 
is more aggressive, aiming for the highest number of compo¬ 
nents that can be extracted with acceptable quality. 

Empirically, the last strategy seems to give better results, however 
they all perform very closely in synthetic data. Particular choice 
of strategy depends on the application needs, e.g. if quality of the 
components is imperative to be high, then strategy 2 should be pre¬ 
ferred over strategy 3. 


Algorithm 3: AutoTen: Automatic Unsupervised Tensor 
Mining 

Input: Tensor X and maximum budget for component search F 
Output: PARAFAC decomposition A, B, C of X and corresponding 
quality metric c*. 

1: for / = 2-Fdo 

2: Run CP_NMU for f components. Update cp ro ( f) with the result 

of Algorithm in 1551. 

3: Run CP_APR for f components. Update c kl (/) with the result of 

Algorithm |T] 

4: end for 

5: Find (Fp ro , c* Fro ) and (F£- L ,c* KL ) using the two-step 
maximization as described in the text. 

6: Choose between CP_NMU and CP_APR using one of the three 
strategies described in the text. 

7: Output the chosen c* and the corresponding decomposition. 


We point out that lines 1-5 of Algorithm[3]are embarrassingly paral¬ 
lel. Finally, it is important to note that AutoTen not only seeks to 
find a good number of components for the decomposition, combin¬ 
ing the best of both worlds of CP_NMU and CP_APR, but further¬ 
more is able to provide quality assessment for the decomposition: if 
for a given F max none of the solutions that AutoTen sifts through 
yields a satisfactory result, the user will be able to tell because of 
the very low (or zero in extreme cases) c* value. 

4. EXPERIMENTAL EVALUATION 

We implemented AutoTen in Matlab, using the Tensor Toolbox 
[29~| . which provides efficient manipulation and storage of sparse 
tensors. We make our code publicly availably. The online version 
of our code contains a test case that uses the same code that we 
used for the following evaluation. All experiments were run on a 
workstation with 4 Intel(R) Xeon(R) E7- 8837 and 1TB of RAM. 

4.1 Evaluation on Synthetic Data 

In this section, we empirically measure AutoTen’s ability to 
uncover the true number of components hidden in a tensor. The 

1 Download our code at 

http://www.cs.emu.edu/~epapalex/src/AutoTen.zip 


experimental setting is as follows: We create synthetic tensors of 
size 50 x 50 x 50, using the function create_problem of the 
Tensor Toolbox for Matlab as a standardized means of generating 
synthetic tensors, we create two different test cases: 1) sparse fac¬ 
tors, with total number of non-zeros equal to 500, and 2) dense 
factors. In both cases, we generate random factors with integer 
values. We generate these three test cases for true rank F 0 rang¬ 
ing from 2-5. For both test cases, we distinguish a noisy case 
(where Gaussian noise with variance 0.1 is by default added by 
create_problem) and a noiseless case. 

We compare AutoTen against three baselines: 

• Baseline 1: A Bayesian tensor decomposition approach, as 
introduced very recently in {24]| which automatically deter¬ 
mines the rank. 

• Baseline 2: This is a very simple heuristic approach where, 
for a grid of values for the rank, we run CP_NMU and record 
the Frobenius norm loss for each solution. If for two consec¬ 
utive iterations the loss does not improve more than a small 
positive number e (set to 10~ 6 here), we declare as output 
the result of the previous iteration. 

• Baseline 3: Same as Baseline 2 with sole difference being 
that we use CP_APR and accordingly instead of the Frobe¬ 
nius norm reconstruction error, we measure the log-likelihood, 
and we stop when it stops improving more than e. We expect 
Baseline 3 to be more effective than Baseline 2 in sparse data, 
due to the more delicate and effective treatment of sparse, 
count data by CP_APR. 

AutoTen as well as Baselines 2 & 3 require a maximum bound 
Fmax. on the rank; for fairness, we set Fmax = 2 F 0 for all three 
methods. In Figures[2]and[3]we show the results for both test cases, 
for noisy and noiseless data respectively. The error is measured as 
l^e st F a \ wheie Fast is the estimated nurnbei of components by 
each method. Due to the randomized nature of the synthetic data 
generation, we ran 1000 iterations and we show the average results. 
In the noisy case (Fig. □ we observe that in both scenarios and 
for all chosen ranks, AutoTen outperforms the baselines, having 
lower error. In the noiseless case of Fig. [3] we observe consis¬ 
tent behavior, with all methods experiencing a small boost to their 
performance, due to the absence of noise. We calculated statistical 
significance of our results (p < 0.01) using a two-sided sign test. 

Overall, we conclude that AutoTen largely outperforms the 
baselines. The problem at hand is an extremely hard one, and we 
are not expecting any tractable method to solve it perfectly. Thus, 
the results we obtain here are very encouraging. 



(a) Sparse (b) Dense 

Figure 2: Error for AutoTen and the baselines, for noisy synthetic data. 


5. DATA MINING CASE STUDY 

After establishing that AutoTen is able to perform well in a 
control, synthetically generated setting, the next step is to see its 
results “in the wild”. To that end, we are conducting two case stud- 





































Table 2: Datasets analyzed 


Name 

Description 

Dimensions 

Number of nonzeros 

ENRON 

(sender, recipient, month) 

186 x 186 x 44 

9838 

Reality Mining ff37ll 

(person, person, means of communication) 

88 x 88 x 4 

5022 

Facebook |38| 

(wall owner, poster, day) 

63891 x 63890 x 1847 

737778 

Taxi (H1ID 

(latitude, longitude,minute) 

100 x 100 x 9617 

17762489 

DBLP |4j] 

(paper, paper, view) 

7317 x 7317 x 3 

274106 

Netflix 

(movie, user, date) 

17770 x 252474 x 88 

50244707 

Amazon co-purchase 1421 

(product, product, product group) 

256 x 256 x 5 

5726 

Amazon metadata 14211 

(product, customer, product group) 

10000 x 263011 x 5 

441301 

Yelp 

(user, business, term) 

43872 x 11536 x 10000 

10009860 

Airport 

(airport, airport, airline) 

9135 x 9135 x 19305 

58443 
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Figure 3: Error for AutoTen and the baselines, for noiseless synthetic 
data. 


ies. Section [5Tl takes 10 diverse real datasets shown in Tableland 
investigates their rank structure. In Section [L2l we apply AutoTen 
to two of the datasets of Tableland we analyze the results, as part 
of an exploratory data mining study. 

5.1 Rank Structure of Real Datasets 

Since exploration of the rank structure of a dataset, using the 
CORCONDIA diagnostic, is an integral part of AutoTen, we deem 
necessary to dive deeper into that process. In this case study we 
are analyzing the rank structure of 10 real datasets, as captured 
by CORCONDIA with Frobenius norm loss (using our algorithm 
from f33l . as well as CORCONDIA with KL-Divergence loss (in¬ 
troduced here). Most of the datasets we use are publicly available 
and can be obtained by either following the link within the orig¬ 
inal work that introduced them, or (whenever applicable) a direct 
link. ENROF0 is a social network dataset, recording the number 
of emails exchanged between employees of the company for a pe¬ 
riod of time, during the company crisis. Reality Mining ED 
is a multi-view social network dataset, recording relations between 
MIT students (who calls whom, who messages whom, who is close 
to whom and so on). Facebook f38l l is a time evolving snapshot 
of Facebook, recording people posting on other peoples’ Walls. 
Taxiflis a dataset of taxi trajectories in Beijing; we discretize lati¬ 
tude and longitude to a 100 x 100 grid. DBLP is a dataset recording 
which researched published what paper under three different views 
(first view shows co-authorship, second view shows citation, and 
third view shows whether two authors share at least three keywords 
in their title or abstract of their papers). Netf lix comes from the 
Netflix prize dataset and records movie ratings by users over time. 
Amazon co-purchase data records items bought together, and 
the category of the first of the two products. Amazon metadata 
records customers who reviewed a product, and the corresponding 
product category. Yelp contains reviews of Yelp users for various 

"http://www.cs.emu.edu/~enron/ 
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http://research.microsoft.com/apps/pubs/?id=152883 


businesses (from the data challengf]). Finally, AirportQ contains 
records of flights between different airports, and the operating air¬ 
line. 

We ran our algorithms for F = 2 • ■ • 50, and truncated nega¬ 
tive values to zero. For KL-Divergence and datasets Facebook, 
Netflix, Yelp, and Airport we used smaller versions (first 
500 rows for Netf lix and Yelp, and first 1000 rows for Facebook 
and Airport), due to high memory requirements of Matlab; this 
means that the corresponding figures describe the rank structure 
of a smaller dataset, which might be different from the full one. 
Figure [4] shows CORCONDIA when using Frobenius norm as a 
loss, and Fig. [5] when using KL-Divergence. The way to inter¬ 
pret these figures is the following: assuming a CP_NMU (Fig. 0 
or a CP_APR (Fig. 0 model, each figure shows the modelling 
quality of the data for a given rank. This sheds light to the rank 
structure of a particular dataset (although that is not to say that 
it provides a definitive answer about its true rank). For the given 
datasets, we observe a few interesting differences in structure: for 
instance, ENRON and Taxi in Fig.0seem to have good quality for 
a few components. On the other hand. Reality Mining, DBLP, 
and Amazon metadata have reasonably acceptable quality for a 
larger range of components, with the quality decreasing as the num¬ 
ber gets higher. Another interesting observation, confirming recent 
results in ED, is that Yelp seems to be modelled better using a 
high number of components. Figures that are all-zero merely show 
that no good structure was detected for up to 50 components, how¬ 
ever, this might indicate that such datasets (e.g. Netflix) have 
an even higher number of components. Finally, contrasting Fig. 0 
to Fig. 0 we observe that in many cases using the KL-Divergence 
is able to discover better structure than the Frobenius norm (e.g. 
ENRON and Amazon co-purchase). 

5.2 AutoTen in practice 

We used AutoTen to analyze two of the datasets shown in Table 
0 In the following lines we show our results. 

5.2.1 Analyzing Taxi 

The data we have span an entire week worth of measurements, 
with temporal granularity of minutes. First, we tried quantizing the 
latitude and longitude into a 1000 x 1000 grid; however, AutoTen 
warned us that the decomposition was not able to detect good and 
coherent structure in the data, perhaps due to the extremely sparse 
variable space of our grid. Subsequently, we modelled the data 
using a 100 x 100 grid and AutoTen was able to detect good 
structure. In particular, AutoTen output 8 rank-one components, 
choosing Frobenius norm as a loss function. 

In Figure0we show 4 representative components of the decom¬ 
position. In each sub-figure, we overlay the map of Beijing with 

4 https://www.yelp.com/dataset_challenge/dataset 
"http://openflights.org/data.html 


































Figure 4: CORCONDIA for CP_NMU 
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Figure 5: CORCONDIA for CP_APR 


the coordinates that appear to have high activity in the particular 
component; every sub-figure also shows the temporal profile of the 
component. The first two components (Fig. [6la), (b)) spatially 
refer to a similar area, roughly corresponding to the tourist and 
business center in the central rings of the city. The difference is 
that Fig. [6ja) shows high activity during the weekdays and de¬ 
clining activity over the weekend (indicated by five peaks of equal 
height, followed by two smaller peaks), whereas Fig. lb) shows 
a slightly inverted temporal profile, where the activity peaks over 
the weekend; we conclude that Fig. [6[a) most likely captures busi¬ 
ness traffic that peaks during the week, whereas Fig. |6jb) captures 
tourist and leisure traffic that peaks over the weekend. The third 
component (Fig. !c)) is highly active around the Olympic Center 
and Convention Center area, with peaking activity in the middle 
of the week. Finally, the last component (Fig. Id) ) shows high 
activity only outside of Beijing’s international airport, where taxis 
gather to pick-up customers; on the temporal side, we see daily 
peaks of activity, with the collective activity dropping during the 
weekend, when there is significantly less traffic of people coming 
to the city for business. By being able to analyze such trajectory 
data into highly interpretable results, we can help policymakers to 


better understand the traffic patterns of taxis in big cities, estimate 
high and low demand areas and times and optimize city planning 
in that respect. There has been very recent work 1401 towards the 
same direction, and we view our results as complementary. 

5.2.2 Analyzing Amazon co-purchase 

This dataset records pairs of products that were purchased to¬ 
gether by the same customer on Amazon, as well as the category 
of the first product in the pair. This dataset, as shown in Figures 
Eg) and[5]does not have perfect trilinear structure, however a low 
rank trilinear approximation still offers reasonably good insights 
for product recommendation and market basket analysis. By ana¬ 
lyzing this dataset, we seek to find coherent groups of products that 
people tend to purchase together, aiming for better product recom¬ 
mendations and suggestions. For the purposes of this study, we 
extracted a small piece of the co-purchase network of 256 prod¬ 
ucts. AutoTen was able to extract 24 components by choosing 
KL-Divergence as a loss. On TableEwe show a representative sub¬ 
set of our resulting components (which were remarkably sparse, 
due to the KL-Divergence fitting by CP_APR). We observe that 
products of similar genre and themes tend to naturally cluster to¬ 
gether. For instance, cluster #1 contains mostly self improvement 
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(a) Tourist & Business Center: High activity during weekdays, low 
over the weekend 
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(b) Downtown: Consistent activity over the week 
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(c) Olympic Center: Activity peak during the week (d) Airport: High activity during weekdays, low over the weekend 

Figure 6: Latent components of the Taxi dataset, as extracted using AutoTen. 


Cluster type 

Products 

Product Types 

#1 Self Improvement 

Resolving Conflicts At Work : A Complete Guide for Everyone on the Job 

How to Kill a Monster (Goosebumps) 

Mensa Visual Brainteasers 

Book 

Book 

Book 


Learning in Overdrive: Designing Curriculum, Instruction, and Assessment from Standards : A Manual for Teachers 

Book 

#2 Psychology, Self Improvement 

Physicians of the Soul: The Psychologies of the World’s Greatest Spiritual Leaders 

The Rise of the Creative Class: And How It’s Transforming Work, Leisure, Community and Everyday Life 

Book 

Book 


Beginning ASP.NET Databases using C# 

Book 

#3 Technical Books 

BizPricer Business Valuation Manual wSoftware 

Book 


Desde Que Samba E Samba 

Music 


War at Sea: A Naval History of World War II 

Book 

#4 History 

Jailed for Freedom: American Women Win the Vote 

Book 


The Perfect Plan (7th Heaven) 

Book 


Table 3: Latent components of the Amazon co-purchase dataset, as extracted using AutoTen 


books. We also observe a few topical outliers, such as the book How 
to Kill a Monster (Goosebumps) in cluster #1, and CD 
Desde Que Samba E Samba in cluster #3 that contains Tech¬ 
nical / Software Development books. 

6. RELATED WORK 

Tensors and their data mining applications. 

One of the first applications was on web mining, extending the 
popular HITS algorithm | l). There has been work on analyzing 
citation networks (such as DBLPj m , detecting anomalies in com¬ 
puter networks [2 [3] 0i|, extracting patterns from and completing 
Knowledge Bases (30 [3 and analyzing time-evolving or multi¬ 
view social networks. (3 01210IH1 ED > The long list of appli¬ 
cation continues, with extensions of Latent Semantic Analysis 1 144] 
0, extensions of Subspace Clustering to higher orders (45l , Crime 
Forecasting USD, Image Processing (13, mining Brain data 1481, 
ED ED, trajectory and mobility data (5011401, and bioinformatics 
(Ml. 


Choosing the right number of components. 

As we’ve mentioned throughout the text, CORCONDIA (251 
is using properties of the PARAFAC decomposition in order to 
hint towards the right number of components. In (33l l, we intro¬ 
duce a scalable algorithm for CORCONDIA (under the Frobenius 
norm). Moving away from the PARAFAC decompostion, Kiers and 
Kinderen (ID introduce a method for choosing the number of com¬ 
ponents for Tucker3. There has been recent work using Minimum 
Description Length (MDL): In 1221 the authors use MDL in the 
context of community detection in time-evolving social network 
tensors, whereas in 1231 . Metzler and Miettinen use MDL to score 
the quality of components for a binary tensor factorization. Finally, 
there have also been recent advances using Bayesian methods 1241 
in order to automatically decide the number of components. 

7. CONCLUSIONS 

In this paper, we work towards an automatic, unsupervised ten¬ 
sor mining algorithm that minimizes user intervention. Our main 
contributions are: 

• Algorithms We propose AutoTen, a novel automatic and 









































































































































































unsupervised tensor mining algorithm, which can provide 
quality characterization of the solution. We extend the highly 
intuitive heuristic of (26j for KL-Divergence loss, providing 
an efficient algorithm. 

• Evaluation & Discovery We evaluate our methods in syn¬ 
thetic data, showing their superiority compared to the base¬ 
lines, as well as a wide variety of real datasets. Finally, we 
apply AutoTen to two real datasets discovering meaningful 
patterns fSection [5.2l l. 
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